\(\;\)

\(\;\)

1 Exercises

1.1 Prediction of Chicago Air Pollution

To explore the nature of the predictive accuracy of various polynomials, we will use a subset of the o3median attribute from Chicago air pollution data from R package gamair. \(O_3\), also known as Ozone or trioxygen, is a strong oxidant with chlorine-like odour that can cause repiratory damage if inhaled. Hence, \(O_3\) is an air pollutant when it’s present where humans live.

library("gamair")
## Warning: package 'gamair' was built under R version 3.5.2
data(chicago)

n <- 500
x <- 1:n
o3.data <- data.frame(x=1:n, y=chicago$o3median[1:n])

head(chicago) # first few lines of the data set
##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
## 4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
## 5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
## 6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data.
  2. Generate \(m=100\) samples of size \(n=20\). Fit polynomials of degree 2 and 10 to every sample.
  3. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).
  4. Using var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10
  5. Using bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.
  6. Generate \(m=100\) samples of size \(n=40\), and using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.

\(\;\)

\(\;\)

\(\;\)

1.2 Fake Data One

Bias Variance trade-offs To explore the nature of the predictive accuracy of various polynomials. We construct a number of functions that allowed us to generate data from the same response model and use these to assess the bias-variance tradeoff of our estimators.

In this question, you are going to do the same, but now using the same mechanism used to generate the fake data used in the notes for illustrating different fits.

  • The fake data was constructed as follows:

    # Get some x's
    set.seed(341)
    N <- 125
    x <- runif(N, 0, 1)
    x = rep(x, each=2)
    
    mu <- function(x) { sin( 3*pi*x )/x }
    # generate some ys
    
    y <- mu(x) + rnorm(N, 0, .6)
    # Here's the data
    
    fake.data = data.frame(x=x, y=y)
  1. Plotting the population.
    1. Plot the population and use the getmuFun function to overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.
    2. Plot the population and overlay the true mean function and the function in i).
    3. Plot the population and overlay a polynomial with degree 2 and 10. Use the getmuhat function.
  2. Generate \(m=100\) samples of size \(n=40\). Fit a polynomial with degree 2 and 10 to every sample.
    1. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population.
    2. Using var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10 .
    3. Using bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.
  3. Generate \(m=100\) samples of size \(n=40\) and using apse_all calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give an conclusion.

\(\;\)

  1. \(k\)-fold cross-validation. Using the sample
n=100
set.seed(341)
fake.sample = sample(nrow(fake.data), n)
sample.data = fake.data[fake.sample,]
  1. Plot the sample data and using the getmuFun function overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.
  2. Create a function creates the \(k\)-fold samples from a given population.i.e.
    • The function has arguements k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.
    • The function outputs a list containing the k-fold samples and test samples labelled as Ssamples and Tsamples.
    • The function rep_len might be helpful.
  3. Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.
  4. Perform \(k=5\)-fold cross-validation to estimate the complexity parameter from the set \(0:10\). Plot APSE by the complexity.
  5. Perform \(B=100\) bootstrap by resampling the errors to quantify the uncertainly in the complexity parameter. Construct a histogram of the complexity parameters.

\(\;\)

\(\;\)

\(\;\)

1.3 Fake Data Two

Bias Variance trade-offs: To explore the nature of the predictive accuracy of various polynomials, we construct a number of functions that allow us to generate data from the same response model and use these to assess the bias-variance trade-off of our estimators. In this question, you are going to do the same for the following model: \[ Y_i = 20\,\log\left(\frac{x^2+5\sin (\pi x)}{x-5}\right) + E_i~~,\quad E_i\sim_{iid} G(0,1) \]

set.seed(341)
N <- 125


# The x values
x <- runif(N, 7,8.5)
x = rep(x, each=2)

mu <- function(x) { 20*log((x^2+5*sin(pi*x))/(x-5)) }

# generate y values
y <- mu(x) + rnorm(N, 0, .2)

# And below is the simulated data
simulated.data = data.frame(x=x, y=y)
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data in different colours. Use the getmuhat function. Looking at this plot, which polynomial better fits the data?

  2. Generate \(m=100\) samples each of size \(n=40\). Fit polynomials of degree 2 and 10 to every sample. Using par(mfrow=c(1,2)) plot all 100 fitted polynomials with degree 2 and all of those with degree 10 on the two different panels. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population in a different colour so that they are visible.

  3. Using var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10. Compare the two numbers.

  4. Using bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.

  5. Generate \(m=100\) samples each of size \(n=40\). Using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.

\(\;\)

  1. \(k\)-fold cross-validation: Consider the sample from the data defined below,
n=100
set.seed(341)
simulated.sample = sample(nrow(simulated.data), n)
sample.data = simulated.data[simulated.sample,]
  1. Plot the sample data, and using the getmuFun function, overlay a function that has the average \(y\) values when the corresponding \(x\) values are the same.
  2. Create a function sample.kfold that creates the \(k\)-fold samples from a given population.i.e.,
    • The function has arguments k: the number of k-fold, pop: a population, xvarname: the name of the \(x\) variable, and yvarname: the of the \(y\) variable.
    • The function outputs a list containing the k-fold train and test samples labelled as Ssamples and Tsamples.
    • The function rep_len might be helpful.

Once you have the function, try

TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])
  1. Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when complexity=3.

  2. Perform \(k=5\)-fold cross-validation to estimate the complexity parameter from the set \(0:10\). Plot APSE by the complexity.

  3. Perform \(B=100\) bootstrap by resampling the errors to quantify the uncertainly in the complexity parameter. Construct a histogram of the complexity parameters to visualize this uncertainty, and comment on the plot.

\(\;\)

\(\;\)

\(\;\)

1.4 Bootstrap for Robust Regression

Suppose that the Animals Data is a sample and we are interested in a linear relationship between log(Brain) and log(Body).

library(MASS)
data(Animals2)

plot(log(Animals2$body), log(Animals2$brain), 
     pch=19, col=adjustcolor("grey", .5),
     xlab="log(body)", ylab="log(Brain)")
  1. Fit a linear function to the given sample using robust regression and the huber function.

  2. Using \(B=1000\) bootstrap samples by sampling the pairs and fit a linear function to each sample using robust regression and the huber function.

    1. Generate a scatterplot of the data along with all the robust regression lines using shaded lines.
    2. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=-1\)
    3. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=10\).
    4. Obtain a 99% confidence interval (using the percentile method) for the robust regression line.
  3. Repeat part b) by resampling the errors to generate the bootstrap samples.

\(\;\)

\(\;\)

\(\;\)

1.5 Weekly temperature data for Cairo

To explore the nature of the predictive accuracy of various polynomials, we will use some weekly average air temperatures in Cairo. Load the file cairo_temp.csv from LEARN.

cairo.temp = read.csv("cairo_temp.csv")
head(cairo.temp)
##   week temp
## 1    1 59.2
## 2    2 56.5
## 3    3 55.7
## 4    4 56.1
## 5    5 58.4
## 6    6 58.5
names(cairo.temp) = c('x', 'y')
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data.

  2. Generate \(m=50\) samples of size \(n=50\). Fit polynomials of degree 2 and 10 to every sample.

  3. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).

  4. Using var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10

  5. Using bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.

  6. Generate \(m=50\) samples of size \(n=50\), and using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.
  7. Instead of randomly constructing sample and test sets we can use \(k\)-fold cross-validation.

    1. Create a function creates the \(k\)-fold samples from a given population. i.e.
      • The function has arguements k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.
      • The function outputs a list containing the k-fold samples and test samples labelled as Ssamples and Tsamples.
      • The function rep_len might be helpful.
    2. Use the function from part i) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.

    3. Perform \(k=10\)-fold cross-validation to estimate the complexity parameter from the set \(0:15\). Plot APSE by the complexity and give a conclusion.

\(\;\)

\(\;\)

\(\;\)

2 Solutions

2.1 Prediction of Chicago Air Pollution

To explore the nature of the predictive accuracy of various polynomials, we will use a subset of the o3median attribute from Chicago air pollution data from R package gamair. \(O_3\), also known as Ozone or trioxygen, is a strong oxidant with chlorine-like odour that can cause repiratory damage if inhaled. Hence, \(O_3\) is an air pollutant when it’s present where humans live.

library("gamair")
data(chicago)

n <- 500
x <- 1:n
o3.data <- data.frame(x=1:n, y=chicago$o3median[1:n])

head(chicago) # first few lines of the data set
##   death pm10median pm25median  o3median  so2median    time tmpd
## 1   130 -7.4335443         NA -19.59234  1.9280426 -2556.5 31.5
## 2   150         NA         NA -19.03861 -0.9855631 -2555.5 33.0
## 3   101 -0.8265306         NA -20.21734 -1.8914161 -2554.5 33.0
## 4   135  5.5664557         NA -19.67567  6.1393413 -2553.5 29.0
## 5   126         NA         NA -19.21734  2.2784649 -2552.5 32.0
## 6   130  6.5664557         NA -17.63400  9.8585839 -2551.5 40.0
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data.
muhat2 <- getmuhat(o3.data, 2)
muhat10 <- getmuhat(o3.data, 10)

xlim <- extendrange(o3.data[x,])

plot(o3.data,
     pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2], 
      add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")

  1. Generate \(m=100\) samples of size \(n=20\). Fit polynomials of degree 2 and 10 to every sample.
getSampleComp <- function(pop, size, replace=FALSE) {
  N <- dim(pop)[1]
  samp <- rep(FALSE, N)
  samp[sample(1:N, size, replace = replace)] <- TRUE
  samp
}


### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
  sampData <- pop[samp, c(xvarname, yvarname)]
  names(sampData) <- c("x", "y")
  sampData
}
N_S <- 100
set.seed(341)  # for reproducibility

n= 20
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(o3.data, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, o3.data)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, o3.data)})

muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)

muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
  1. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).
par(mfrow=c(1,2))

xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(o3.data, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 2) & mubar")


for (i in 1:N_S) {
  curveFn <- muhats2[[i]]
  curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=(1))
}

curve(muhat2,  from = xlim[1], to = xlim[2],
      add=TRUE, col="firebrick", lwd=3)

points(o3.data, 
     pch=19, col= adjustcolor("black", 0.5))


plot(o3.data, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 10) & mubar")

for (i in 1:N_S) {
  curveFn <- muhats10[[i]]
  curve(curveFn, xlim[1], xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=1)
}

curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)

points(o3.data, 
     pch=19, col= adjustcolor("black", 0.5))

  1. Using var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10
getmubar <- function(muhats) {
  function(x) {
    Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
    apply(Ans, MARGIN=1, FUN=mean)
  }
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
  mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
 
###########

ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
  mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}

###########


var_mutilde <- function(Ssamples, Tsamples, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample){
                     getmuhat(sample, complexity)
                   }
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## get muhat based on sample S_j
                muhat <- muhats[[j]]
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(muhat, mubar, T_j$x)
              }
  )
  )
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 39.4854
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 5313019
  • The sampling variability is much higher for degree 10 polynomials. This is an indication of overfitting.
  1. Using bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.
bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(mubar, mu, T_j$x)
              }
  )
  )
}
muhat = getmuFun(o3.data, "x", 'y')

bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 105.7658
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 14093.58
  1. Generate \(m=100\) samples of size \(n=40\), and using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.
complexities <- 0:10

apse_vals <-  sapply(complexities, 
                     FUN = function(complexity){
                       apse_all(Ssamples, Tsamples, 
                                complexity = complexity, mu = muhat)
                     }
)

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
##       complexities         apse  var_mutilde       bias2    var_y
##  [1,]            0 1.419895e+02 5.504920e+00   109.74258 41.35165
##  [2,]            1 1.497814e+02 1.423684e+01   108.20061 41.35165
##  [3,]            2 1.735070e+02 3.948540e+01   105.76581 41.35165
##  [4,]            3 9.734785e+01 2.377330e+01    46.07064 41.35165
##  [5,]            4 1.521383e+02 7.382526e+01    50.00529 41.35165
##  [6,]            5 3.701323e+02 2.983637e+02    43.44568 41.35165
##  [7,]            6 8.089953e+03 7.855074e+03   192.62305 41.35165
##  [8,]            7 1.921738e+04 1.896188e+04   211.56790 41.35165
##  [9,]            8 2.310388e+05 2.280033e+05  2754.64124 41.35165
## [10,]            9 7.649464e+04 7.421728e+04  2059.55748 41.35165
## [11,]           10 5.328469e+06 5.313019e+06 14093.57714 41.35165
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, 6e+06), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)

# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 0:5
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, 400), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)

  • The polynomial with degree 3 has the lowest APSE.
  • The variance of the predictor function shows a high margin of increase for higher degrees, which indicates overfitting as shown earlier.
  • The bias starts off being higher than the variance of the predictor function, but is taken over from degree 4. Its relatively slower increase is expected given the much higher rate of increase in variance.
muhat3 <- getmuhat(o3.data, 3)

xlim <- extendrange(o3.data[x,])

plot(o3.data,
     pch=19, col= adjustcolor("black", 0.5),
     main="Best fit (degree 3)")
curve(muhat3, from = xlim[1], to = xlim[2], 
      add = TRUE, col="red", lwd=2)

  • The visual assessment further confirms a decent fit of degree 3 polynomial.
  • This fit is quite different from degree 10 polynomial outside the range of the sample.

\(\;\)

\(\;\)

\(\;\)

2.2 Fake Date One

Bias Variance trade-offs To explore the nature of the predictive accuracy of various polynomials. We construct a number of functions that allowed us to generate data from the same response model and use these to assess the bias-variance tradeoff of our estimators.

In this question, you are going to do the same, but now using the same mechanism used to generate the fake data used in the notes for illustrating different fits.

  • The fake data was constructed as follows:

    # Get some x's
    set.seed(341)
    N <- 125
    x <- runif(N, 0, 1)
    x = rep(x, each=2)
    
    mu <- function(x) { sin( 3*pi*x )/x }
    # generate some ys
    
    y <- mu(x) + rnorm(N, 0, .6)
    # Here's the data
    
    fake.data = data.frame(x=x, y=y)
  1. Plot the population.
temp = getmuFun(fake.data, "x", 'y')
  1. Plot the population and use the getmuFun function to overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.

  2. Plot the population and overlay the true mean function and the function in i).

plot(x, y, 
     col=adjustcolor("firebrick", 0.4), 
     pch=19, cex=0.5,
     main = "The fake data with, mu(x) = sin( 3*pi*x )/x",  xlab="x", ylab="reponse" )

muhat = getmuFun(fake.data, "x", 'y')
temp = getmuFun(fake.data, "x", 'y')

# overlay the mean function
xrange <- extendrange(x)
newx <- seq(min(xrange), max(xrange), length.out = 500)

lines(newx, temp(newx), col=1, type='l')
lines(newx, mu(newx), type='l', col=4)

  1. Plot the population and overlay a polynomial with degree 2 and 10. Use the getmuhat function.
### The function getSampleComp will return a logical vector
### (that way the complement is also recorded in the sample)
getSampleComp <- function(pop, size, replace=FALSE) {
  N <- popSize(pop)
  samp <- rep(FALSE, N)
  samp[sample(1:N, size, replace = replace)] <- TRUE
  samp
}

### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
  sampData <- pop[samp, c(xvarname, yvarname)]
  names(sampData) <- c("x", "y")
  sampData
}
xvarname <- "x"
yvarname <- "y"

pop <- data.frame(y=y,x=x)

### Construct the sample
samp <- getSampleComp(pop, 50)
sampData <- getXYSample(xvarname, yvarname, samp, pop)

### Set the degree of the polynomial
complexity <- 2

muhat2 <- getmuhat(sampData, 2)
muhat10 <- getmuhat(sampData, 10)

xlim <- extendrange(pop[, xvarname])

plot(sampData, 
     main=paste0("muhat (p=", complexity,") and its sample"),
     xlab = xvarname, ylab = yvarname,
     pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)

  1. Generate \(m=100\) samples of size \(n=40\). Fit a polynomial with degree 2 and 10 to every sample.
N_S <- 100
set.seed(341)  # for reproducibility

n= 40
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(pop, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, pop)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, pop)})

muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)

muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
  1. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population.
par(mfrow=c(1,2))

xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(pop[,c(xvarname, yvarname)], 
     pch=19, col= adjustcolor("black", 0.5),
     xlab="x", ylab="predictions",
     main= paste0(N_S, " muhats (degree = ", complexity, ") and mubar")
)

for (i in 1:N_S) {
  curveFn <- muhats2[[i]]
  curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col="blue", lwd=3, lty=(i+1))
}

curve(muhat2,  from = xlim[1], to = xlim[2],
      add=TRUE, col="firebrick", lwd=3)


plot(pop[,c(xvarname, yvarname)], 
     pch=19, col= adjustcolor("black", 0.5),
     xlab="x", ylab="predictions",
     main= paste0(N_S, " muhats (degree = ", complexity, ") and mubar")
)

for (i in 1:N_S) {
  curveFn <- muhats10[[i]]
  curve(curveFn, xlim[1], xlim[2], add=TRUE, col="blue", lwd=3, lty=(i+1))
}

curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)

  1. Using var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10 .
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 0.1288159
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 3.694942
  1. Using bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 1.357051
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 0.182035
  1. Generate \(m=100\) samples of size \(n=40\) and using apse_all calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give an conclusion.
complexities <- 0:10

apse_vals <-  sapply(complexities, 
                     FUN = function(complexity){
                       apse_all(Ssamples, Tsamples, 
                                complexity = complexity, mu = muhat)
                     }
)

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
##       complexities     apse var_mutilde    bias2   var_y
##  [1,]            0 15.00455     0.28249 14.12932 0.51809
##  [2,]            1  7.60287     0.29261  6.80066 0.51809
##  [3,]            2  1.81655     0.12882  1.35705 0.51809
##  [4,]            3  1.44675     0.14690  0.99523 0.51809
##  [5,]            4  0.58793     0.08385  0.23916 0.51809
##  [6,]            5  0.43442     0.05318  0.14097 0.51809
##  [7,]            6  0.47345     0.08851  0.14385 0.51809
##  [8,]            7  0.54486     0.15736  0.13687 0.51809
##  [9,]            8  0.91761     0.51795  0.14446 0.51809
## [10,]            9  0.82807     0.44424  0.12669 0.51809
## [11,]           10  4.14580     3.69494  0.18204 0.51809
plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l', ylim=c(0, 2), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)

  • The polynomial with dgree 5 has the lowest APSE.

\(\;\)

\(\;\)

\(\;\)

  1. \(k\)-fold cross-validation. Using the sample
n=100
set.seed(341)
fake.sample = sample(nrow(fake.data), n)
sample.data = fake.data[fake.sample,]
  1. Plot the sample data and using the getmuFun function overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.
plot(sample.data$x, sample.data$y, 
     col=adjustcolor("firebrick", 0.4), 
     pch=19, cex=0.5,
     main = "The fake data with, mu(x) = sin( 3*pi*x )/x",  xlab="x", ylab="reponse" )

sample.muFun = getmuFun(sample.data, "x", 'y')
curve(temp, from = -1, to =1.2, add=TRUE)

  1. Create a function creates the \(k\)-fold samples from a given population.i.e.
    • The function has arguements k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.
    • The function outputs a list containing the k-fold samples and test samples labelled as Ssamples and Tsamples.
    • The function rep_len might be helpful.
sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
  
  N = nrow(pop)
  kset = rep_len(1:k, N)
  kset = sample(kset)
  
  samps = list()
  for (i in 1:k) { 
    samps[[i]] = logical(N)
    samps[[i]][kset != i] = TRUE
  }
  
Ssamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})

Tsamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})

    list(Ssamples=Ssamples,Tsamples=Tsamples)
}
  1. Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.
kfold.samples = sample.kfold(k=5, pop=sample.data, "x", "y")

apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 2, mu = sample.muFun)
##        apse var_mutilde       bias2       var_y 
##  1.56981603  0.01024957  1.23462564  0.69410576
  1. Perform \(k=5\)-fold cross-validation to estimate the complexity parameter from the set \(0:10\). Plot APSE by the complexity.
complexities <- 0:10

apse_vals <-sapply(complexities, 
      FUN = function(complexity){
          apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, 
              complexity = complexity, mu = sample.muFun) })

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
##       complexities     apse var_mutilde    bias2   var_y
##  [1,]            0 17.60165     0.08744 16.26245 0.69411
##  [2,]            1  7.47448     0.07542  6.64126 0.69411
##  [3,]            2  1.56982     0.01025  1.23463 0.69411
##  [4,]            3  1.38883     0.01035  1.08309 0.69411
##  [5,]            4  0.49961     0.00665  0.25935 0.69411
##  [6,]            5  0.44107     0.00786  0.21582 0.69411
##  [7,]            6  0.43991     0.00855  0.20605 0.69411
##  [8,]            7  0.44438     0.01418  0.20408 0.69411
##  [9,]            8  0.47047     0.02907  0.20577 0.69411
## [10,]            9  0.43243     0.01430  0.20181 0.69411
## [11,]           10  0.44841     0.02037  0.19582 0.69411
complexities[ apse_vals[2,] == min(apse_vals[2,])  ]
## [1] 4
plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l', ylim=c(0, 2), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)

  • We pick a five degree polynomial.
  1. Perform \(B=100\) bootstrap by resampling the errors to quantify the uncertainly in the complexity parameter. Construct a histogram of the complexity parameters.
muhats5 <- getmuhat( sample.data, complexity = 5)

y.hat = muhats5(sample.data$x)
R = sample.data$y - y.hat

B = 100
n = nrow(sample.data)
complexities = 0:10
z = numeric(B)
boot.data = sample.data

set.seed(341)
for (i in 1:B) {
 #boot.data = sample.data[sample(n, n, replace=TRUE),]
 boot.data$y = y.hat + R[sample(n, n, replace=TRUE)]
 boot.muFun = getmuFun(boot.data, "x", 'y')
 kfold.boot = sample.kfold(k=5, pop=boot.data, "x", "y")
 
 apse.boot <- sapply(complexities, 
      FUN = function(complexity){
          apse_all(kfold.boot$Ssamples, kfold.boot$Tsamples, 
              complexity = complexity, mu = boot.muFun) })
z[i] = complexities[ apse.boot[2,] == min(apse.boot[2,])  ]
}
hist(z, breaks=0:12 -1/2, xlab="Model Complexity", main="Histogram of Model Complexity", prob=TRUE)

\(\;\)

\(\;\)

\(\;\)

\(\;\)

2.3 Fake Data Two

Bias Variance trade-offs: To explore the nature of the predictive accuracy of various polynomials, we construct a number of functions that allow us to generate data from the same response model and use these to assess the bias-variance trade-off of our estimators. In this question, you are going to do the same for the following model: \[ Y_i = 20\,\log\left(\frac{x^2+5\sin (\pi x)}{x-5}\right) + E_i~~,\quad E_i\sim_{iid} G(0,1) \]

set.seed(341)
N <- 125


# The x values
x <- runif(N, 7,8.5)
x = rep(x, each=2)

mu <- function(x) { 20*log((x^2+5*sin(pi*x))/(x-5)) }

# generate y values
y <- mu(x) + rnorm(N, 0, .2)

# And below is the simulated data
simulated.data = data.frame(x=x, y=y)
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data in different colours. Use the getmuhat function. Looking at this plot, which polynomial better fits the data?

Solution:

muhat2 <- getmuhat(simulated.data, 2)
muhat10 <- getmuhat(simulated.data, 10)

xlim <- extendrange(simulated.data[,1])

plot(simulated.data,
     pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2], 
      add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")

Clearly, the degree 10 polynomial fits better, but the question is, would a smaller degree than 10 (but larger than 2) fits just as well?

  1. Generate \(m=100\) samples each of size \(n=40\). Fit polynomials of degree 2 and 10 to every sample. Using par(mfrow=c(1,2)) plot all 100 fitted polynomials with degree 2 and all of those with degree 10 on the two different panels. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population in a different colour so that they are visible.

Solution:

getSampleComp <- function(pop, size, replace=FALSE) {
  N <- dim(pop)[1]
  samp <- rep(FALSE, N)
  samp[sample(1:N, size, replace = replace)] <- TRUE
  samp
}


### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
  sampData <- pop[samp, c(xvarname, yvarname)]
  names(sampData) <- c("x", "y")
  sampData
}
N_S <- 100
set.seed(341)  # for reproducibility

n= 40
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(simulated.data, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, simulated.data)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, simulated.data)})

muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)

muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
par(mfrow=c(1,2))

xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(simulated.data, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 2) & mubar")


for (i in 1:N_S) {
  curveFn <- muhats2[[i]]
  curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col="blue", lwd=3, lty=(1))
}

curve(muhat2,  from = xlim[1], to = xlim[2],
      add=TRUE, col="firebrick", lwd=3)

points(simulated.data, 
     pch=19, col= adjustcolor("black", 0.5))


plot(simulated.data, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 10) & mubar")

for (i in 1:N_S) {
  curveFn <- muhats10[[i]]
  curve(curveFn, xlim[1], xlim[2], add=TRUE, col="blue", lwd=3, lty=1)
}

curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)

points(simulated.data, 
     pch=19, col= adjustcolor("black", 0.5))

  1. Using var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10. Compare the two numbers.

Solution:

getmubar <- function(muhats) {
  function(x) {
    Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
    apply(Ans, MARGIN=1, FUN=mean)
  }
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
  mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
 
###########

ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
  mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}

###########

var_mutilde <- function(Ssamples, Tsamples, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample){
                     getmuhat(sample, complexity)
                   }
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## get muhat based on sample S_j
                muhat <- muhats[[j]]
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(muhat, mubar, T_j$x)
              }
  )
  )
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 0.01457752
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 0.0124782

There is no clear difference in variability between the two groups, though degree 10 polynomials have a slightly lower variability. Compare the two numbers.

  1. Using bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.

Solution:

bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(mubar, mu, T_j$x)
              }
  )
  )
}
muhat = getmuFun(simulated.data, 'x', 'y')

bias2_mutilde(Ssamples, Tsamples, muhat,complexity=2)
## [1] 0.1644139
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 0.01871003

There is a clear difference (almost an order of magnitude) between the squared biases. Of course, the more complex model (degree 10) has a lower bias.

  1. Generate \(m=100\) samples each of size \(n=40\). Using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.

Solution:

apse_all <- function(Ssamples, Tsamples, complexity, mu){
  ## average over the samples S
  ##
  N_S <- length(Ssamples)
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  rowMeans(sapply(1:N_S, 
                  FUN=function(j){
                    T_j <- Tsamples[[j]]
                    muhat <- muhats[[j]]
                    ## Take care of any NAs
                    T_j <- na.omit(T_j)
                    y <- T_j$y
                    x <- T_j$x
                    mu_x <- mu(x)
                    muhat_x <- muhat(x)
                    mubar_x <- mubar(x)
                    
                    ## apse
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    apse <- (y - muhat_x)
                    
                    ## bias2:
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    bias2 <- (mubar_x -mu_x)
                    
                    ## var_mutilde
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    var_mutilde <-  (muhat_x - mubar_x)
                    
                    ## var_y :
                    ## average over (x_i,y_i) in a
                    ## single sample T_j the squares
                    ## (y - muhat(x))^2
                    var_y <- (y - mu_x)
                    
                    ## Put them together and square them
                    squares <- rbind(apse, var_mutilde, bias2, var_y)^2
                    
                    ## return means
                    rowMeans(squares)
                  }
  ))
}
complexities <- 0:10

apse_vals <-  sapply(complexities, 
                     FUN = function(complexity){
                       apse_all(Ssamples, Tsamples, 
                                complexity = complexity, mu = muhat)
                     }
)

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
##       complexities    apse var_mutilde   bias2   var_y
##  [1,]            0 0.97581     0.01950 0.92848 0.01972
##  [2,]            1 0.91632     0.03872 0.84349 0.01972
##  [3,]            2 0.20308     0.01458 0.16441 0.01972
##  [4,]            3 0.04853     0.00488 0.02295 0.01972
##  [5,]            4 0.04826     0.00573 0.02141 0.01972
##  [6,]            5 0.04806     0.00636 0.02020 0.01972
##  [7,]            6 0.04976     0.00772 0.02018 0.01972
##  [8,]            7 0.05052     0.00889 0.01945 0.01972
##  [9,]            8 0.05151     0.00990 0.01912 0.01972
## [10,]            9 0.05289     0.01123 0.01892 0.01972
## [11,]           10 0.05429     0.01248 0.01871 0.01972
# Plotting the results
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l',
      ylim=c(0, 1), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)

indx = which.min(t(apse_vals)[,'apse'])
complexities[indx]
## [1] 5

The polynomial with degree 5 has the lowest APSE.

\(\;\)

  1. \(k\)-fold cross-validation: Consider the sample from the data defined below,
n=100
set.seed(341)
simulated.sample = sample(nrow(simulated.data), n)
sample.data = simulated.data[simulated.sample,]
  1. Plot the sample data, and using the getmuFun function, overlay a function that has the average \(y\) values when the corresponding \(x\) values are the same.

Solution:

plot(sample.data$x, sample.data$y, 
     col=adjustcolor("firebrick", 0.4), 
     pch=19, cex=0.5,
     main = "The Sample Data",  xlab="x", ylab="reponse" )

sample.muFun = getmuFun(sample.data, "x", 'y')

curve(sample.muFun, from = -6, to =15, add=TRUE)

  1. Create a function sample.kfold that creates the \(k\)-fold samples from a given population.i.e.,
    • The function has arguments k: the number of k-fold, pop: a population, xvarname: the name of the \(x\) variable, and yvarname: the of the \(y\) variable.
    • The function outputs a list containing the k-fold train and test samples labelled as Ssamples and Tsamples.
    • The function rep_len might be helpful.

Once you have the function, try

TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])

Solution:

sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
  
  N = nrow(pop)
  kset = rep_len(1:k, N)
  kset = sample(kset)
  
  samps = list()
  for (i in 1:k) { 
    samps[[i]] = logical(N)
    samps[[i]][kset != i] = TRUE
  }
  
Ssamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})

Tsamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})
                   
    list(Ssamples=Ssamples,Tsamples=Tsamples)
}

TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])

  1. Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when complexity=3.

Solution:

kfold.samples = sample.kfold(k=5, pop=sample.data, "x", "y")

apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 3, mu = sample.muFun)
##         apse  var_mutilde        bias2        var_y 
## 0.0482549336 0.0003269455 0.0346875589 0.0110752583
  1. Perform \(k=5\)-fold cross-validation to estimate the complexity parameter from the set \(0:10\). Plot APSE by the complexity.

Solution:

complexities <- 0:10

apse_vals <-sapply(complexities, 
      FUN = function(complexity){
          apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, 
              complexity = complexity, mu = sample.muFun) })

# Print out the results
apse.Analysis = t(rbind(complexities, apse=round(apse_vals,5)))

plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l',
      ylim=c(0, 1.5), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)

We pick a degree 5 polynomial based on cross-validation. Notice that The difference between complexities in this neighbourhood is negligible. Also, one can increase the degrees of polynomials beyond 10 in this plot to better visualize the increase in APSE when the complexity is too large.

  1. Perform \(B=100\) bootstrap by resampling the errors to quantify the uncertainly in the complexity parameter. Construct a histogram of the complexity parameters to visualize this uncertainty, and comment on the plot.

Solution:

muhats4 <- getmuhat(sample.data, complexity = 4)

y.hat = muhats4(sample.data$x)
R = sample.data$y - y.hat

B = 100
n = nrow(sample.data)
complexities = 0:10
z = numeric(B)
boot.data = sample.data

set.seed(341)
for (i in 1:B) {
 #boot.data = sample.data[sample(n, n, replace=TRUE),]
 boot.data$y = y.hat + R[sample(n, n, replace=TRUE)]
 boot.muFun = getmuFun(boot.data, "x", 'y')
 kfold.boot = sample.kfold(k=5, pop=boot.data, "x", "y")
 
 apse.boot <- sapply(complexities, 
      FUN = function(complexity){
          apse_all(kfold.boot$Ssamples, kfold.boot$Tsamples, 
              complexity = complexity, mu = boot.muFun) })
z[i] = complexities[ apse.boot[2,] == min(apse.boot[2,])  ]
}

hist(z, breaks=0:12 -1/2, xlab="Model Complexity", main="Histogram of Model Complexity", prob=TRUE)

We see that a polynomial degree 3 is what this analysis is really suggesting, and the variability around this value is quite small. Reading the APSE plot, there does not seem to be a significant difference between APSE of degree 3 polynomial and that of a degree 5.

\(\;\)

\(\;\)

\(\;\)

\(\;\)

2.4 Bootstrap for Robust Regression

Suppose that the Animals Data is a sample and we are interested in a linear relationship between log(Brain) and log(Body).

library(MASS)
data(Animals2)

plot(log(Animals2$body), log(Animals2$brain), 
     pch=19, col=adjustcolor("grey", .5),
     xlab="log(body)", ylab="log(Brain)")
  1. Fit a linear function to the given sample using robust regression and the huber function.
library(robustbase)
## Warning: package 'robustbase' was built under R version 3.5.2
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.2
 data(Animals2)
 #Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y  = log(Animals2$brain)
n =length(y)
plot(x,y, pch=19, col=adjustcolor("grey", .5),
     xlab="log(body)", ylab="log(Brain)")
beta.hat = rlm(y ~ x, psi="psi.huber")$coef

abline(coef=beta.hat, col=adjustcolor("firebrick", 1))

  1. Using \(B=1000\) bootstrap samples by sampling the pairs and fit a linear function to each sample using robust regression and the huber function.

\(\;\)

  1. Generate a scatterplot of the data along with all the robust regression lines using shaded lines.

Obtaining the bootstrap samples.

B = 1000
m = 1000
set.seed(341)

beta.boot = t(sapply(1:B, FUN =function(b)  
  rlm(y~ x, subset=sample(n,n, replace=TRUE), psi="psi.huber", maxit = 200)$coef ))

Plotting the results.

xseq = seq(-6, 12, lengh.out=100)
## Warning: In seq.default(-6, 12, lengh.out = 100) :
##  extra argument 'lengh.out' will be disregarded
plot(x, y, pch=19, col=adjustcolor("firebrick", 0.3),
     xlab="log(body)", ylab="log(Brain)")

apply(beta.boot, 1, function(coef, x) {
     lines(x, coef[1] + x*coef[2], col= adjustcolor("grey", 0.1) )
     }, x=xseq)
## NULL
points(x, y, pch=19, col=adjustcolor("firebrick", 0.3))

  • Varying the shading allows us to see the most common fits among all of them.
  • In the identical plot but without changes in shade, we cannot see where the fits overlap the most, and we lose an important piece of information in assessing adequacy of the model.

\(\;\)

  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=-1\)
a <- 0.01
boot.fit1 = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=-1)

quantile(boot.fit1, probs=c(a/2, 1-a/2))
##     0.5%    99.5% 
## 1.175522 1.785296
  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=10\).
a <- 0.01
boot.fit2 = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=10)

quantile(boot.fit2, probs=c(a/2, 1-a/2))
##     0.5%    99.5% 
## 6.576325 9.846972
  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line.
# computing quantile
xseq = seq(-6, 12, length.out = 50)

boot.fit = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=xseq)

plot(x, y, pch=19, col=adjustcolor("firebrick", 0.6),
     xlab="log(body)", ylab="log(Brain)" )

ci = apply(boot.fit, 1, quantile, c(a/2, 1-a/2))

lines(xseq, ci[1,], lty=2, lwd=2, col=1 )
lines(xseq, ci[2,], lty=2, lwd=2, col=1 )
lines(xseq, beta.hat[1] + xseq*beta.hat[2], col=1)

points(x, y, pch=19, col=adjustcolor("firebrick", 0.6) )

\(\;\)

\(\;\)

\(\;\)

  1. Repeat part b) by resampling the errors to generate the bootstrap samples.

\(\;\)

  1. Generate a scatterplot of the data along with all the robust regression lines using shaded lines.

Obtaining the bootstrap samples.

m = 1000
set.seed(341)

y.hat = beta.hat[1] + beta.hat[2]*x
R = y - y.hat
nonpar.boot.sam = Map(function(b)  
  { Rstar = R[sample(n,n,replace=TRUE)]
    y = beta.hat[1] + beta.hat[2]*x + Rstar
    data.frame( x = x, y= y )  } , 1:B)

beta.boot = t(sapply(nonpar.boot.sam, function(sam) {  
  rlm(y~ x, data=sam, psi="psi.huber", maxit = 100 )$coef } ))

Plotting the results.

plot(x, y, pch=19, col=adjustcolor("firebrick", 0.3),
     xlab="log(body)", ylab="log(Brain)" )

apply(beta.boot, 1, function(coef, x) {
     lines(x, coef[1] + x*coef[2], col= adjustcolor("grey", 0.1) )
     }, x=xseq)
## NULL
points(x, y, pch=19, col=adjustcolor("firebrick", 0.3))

  • Varying the shading allows us to see the most common fits among all of them.
  • In the identical plot but without changes in shade, we cannot see where the fits overlap the most, and we lose an important piece of information in assessing adequacy of the model.

\(\;\)

  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=-1\)
a <- 0.01
boot.fit1 = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=-1)

quantile(boot.fit1, probs=c(a/2, 1-a/2))
##     0.5%    99.5% 
## 1.086172 1.712245
  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line when \(x=10\).
a <- 0.01
boot.fit2 = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=10)

quantile(boot.fit2, probs=c(a/2, 1-a/2))
##     0.5%    99.5% 
## 8.449129 9.749525
  1. Obtain a 99% confidence interval (using the percentile method) for the robust regression line.
# computing quantile
xseq = seq(-6, 12, length.out = 50)

boot.fit = apply(beta.boot, 1, function(coef, x) {
     coef[1] + x*coef[2]
     }, x=xseq)

plot(x, y, pch=19, col=adjustcolor("firebrick", 0.6) )

ci = apply(boot.fit, 1, quantile, c(a/2, 1-a/2))

lines(xseq, ci[1,], lty=2, lwd=2, col=1 )
lines(xseq, ci[2,], lty=2, lwd=2, col=1 )
lines(xseq, beta.hat[1] + xseq*beta.hat[2], col=1)

\(\;\)

\(\;\)

\(\;\)

\(\;\)

2.5 Weekly temperature data for Cairo

To explore the nature of the predictive accuracy of various polynomials, we will use some weekly average air temperatures in Cairo. Load the file cairo_temp.csv

cairo.temp = read.csv("cairo_temp.csv")
head(cairo.temp)
##   week temp
## 1    1 59.2
## 2    2 56.5
## 3    3 55.7
## 4    4 56.1
## 5    5 58.4
## 6    6 58.5
names(cairo.temp) = c('x', 'y')
  1. Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data.
getmuhat <- function(sampleXY, complexity = 1) {
  formula <- paste0("y ~ ",
                    if (complexity==0) "1"
                    else {
                      if (complexity < 3 ) {
                        paste0("poly(x, ", complexity, ", raw = FALSE)") 
                        ## due to Numerical overflow 
                      } else {   
                        ## if complexity >= 20 use a spline.
                        paste0("bs(x, ", complexity, ")") 
                      }
                    }  
  )
  
  fit <- lm(as.formula(formula), data = sampleXY)
  tx = sampleXY$x
  ty = fit$fitted.values
  
  range.X = range(tx)
  val.rY  = c( mean(ty[tx == range.X[1]]), 
               mean(ty[tx == range.X[2]]) )
  
  ## From this we construct the predictor function
  muhat <- function(x){
    if ("x" %in% names(x)) {
      ## x is a dataframe containing the variate named
      ## by xvarname
      newdata <- x
    } else 
      ## x is a vector of values that needs to be a data.frame
    { newdata <- data.frame(x = x) }
    ## The prediction
    ## 
    suppressWarnings({ 
      ypred = predict(fit, newdata = newdata, silent = TRUE)    })
    #val = predict(fit, newdata = newdata)
    ypred[newdata$x < range.X[1]] = val.rY[1]
    ypred[newdata$x > range.X[2]] = val.rY[2]
    ypred
  }
  ## muhat is the function that we need to calculate values 
  ## at any x, so we return this function from getmuhat
  muhat
}
muhat2  <- getmuhat(cairo.temp, 2)
muhat10 <- getmuhat(cairo.temp, 10)

xlim <- extendrange(cairo.temp[,1])

plot(cairo.temp,
     pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2], 
      add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2], 
      add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")

  1. Generate \(m=50\) samples of size \(n=50\). Fit polynomials of degree 2 and 10 to every sample.
getSampleComp <- function(pop, size, replace=FALSE) {
  N <- dim(pop)[1]
  samp <- rep(FALSE, N)
  samp[sample(1:N, size, replace = replace)] <- TRUE
  samp
}


### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
  sampData <- pop[samp, c(xvarname, yvarname)]
  names(sampData) <- c("x", "y")
  sampData
}
N_S <- 50
set.seed(341)  # for reproducibility

n= 50
samps    <- lapply(1:N_S, FUN= function(i){getSampleComp(cairo.temp, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, cairo.temp)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, cairo.temp)})

muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)

muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
  1. Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).
par(mfrow=c(1,2))

xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(cairo.temp, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 2) & mubar")


for (i in 1:N_S) {
  curveFn <- muhats2[[i]]
  curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=(1))
}

curve(muhat2,  from = xlim[1], to = xlim[2],
      add=TRUE, col="firebrick", lwd=3)

points(cairo.temp, 
     pch=19, col= adjustcolor("black", 0.5))


plot(cairo.temp, 
     pch=19, type='n',
     xlab="x", ylab="predictions",
     main= " muhats (degree = 10) & mubar")

for (i in 1:N_S) {
  curveFn <- muhats10[[i]]
  curve(curveFn, xlim[1], xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=1)
}

curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)

points(cairo.temp, 
     pch=19, col= adjustcolor("black", 0.5))

  1. Using var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10
getmubar <- function(muhats) {
  function(x) {
    Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
    apply(Ans, MARGIN=1, FUN=mean)
  }
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
  mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
 
###########

ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
  mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}

###########


var_mutilde <- function(Ssamples, Tsamples, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample){
                     getmuhat(sample, complexity)
                   }
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## get muhat based on sample S_j
                muhat <- muhats[[j]]
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(muhat, mubar, T_j$x)
              }
  )
  )
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 2.973982
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 2.953363
  • The sampling variability roughly the same size.
  1. Using bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.
bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
  ## get the predictor function for every sample S
  muhats <- lapply(Ssamples, 
                   FUN=function(sample) getmuhat(sample, complexity)
  )
  ## get the average of these, mubar
  mubar <- getmubar(muhats)
  
  ## average over all samples S
  N_S <- length(Ssamples)
  mean(sapply(1:N_S, 
              FUN=function(j){
                ## average over (x_i,y_i) in a
                ## single sample T_j the squares
                ## (y - muhat(x))^2
                T_j <- Tsamples[[j]]
                ave_mu_mu_sq(mubar, mu, T_j$x)
              }
  )
  )
}
muhat = getmuFun(cairo.temp, "x", 'y')
#cairo.temp$y - muhat(cairo.temp$x)

bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 96.19658
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 13.86907
  1. Generate \(m=50\) samples of size \(n=50\), and using apse_all function, calculate the APSE for complexities equal to 0:10.
    • Summarize the results with a table and a graphical display.
    • Give a conclusion.
complexities <- 0:15
muhat = getmuFun(cairo.temp, "x", 'y')


apse_vals <-  sapply(complexities, 
                     FUN = function(complexity){
                       apse_all(Ssamples, Tsamples, 
                                complexity = complexity, mu = muhat)
                     }
)

Print out the results

round( t(rbind(complexities, apse=round(apse_vals,5))),1)
##       complexities  apse var_mutilde bias2 var_y
##  [1,]            0 112.7         1.3 109.5     0
##  [2,]            1 115.4         2.5 109.2     0
##  [3,]            2 103.2         3.0  96.2     0
##  [4,]            3 105.5         4.0  96.1     0
##  [5,]            4  56.5         4.6  49.0     0
##  [6,]            5  65.1         7.5  52.9     0
##  [7,]            6  24.1         5.6  16.4     0
##  [8,]            7  22.4         4.0  16.2     0
##  [9,]            8  19.9         3.2  14.5     0
## [10,]            9  19.5         3.1  14.0     0
## [11,]           10  19.3         3.0  13.9     0
## [12,]           11  19.5         3.2  13.6     0
## [13,]           12  20.1         3.6  13.5     0
## [14,]           13  20.5         4.0  13.2     0
## [15,]           14  20.7         4.3  13.0     0
## [16,]           15  21.3         4.7  12.9     0
par(mfrow=c(1,2))
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals) ), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)

# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 7:15
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals[,zoom])  ), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)

  • The polynomial with degree 10 has the lowest APSE.
  • The variance of the predictor function increases for higher degree polynomials.
  • The bias starts off high and decreases until about degree 8 (or 9,10)
  • The main source of APSE is due the bias of the lower degree polynomials.
muhat10 <- getmuhat(cairo.temp, 10)

xlim <- extendrange(cairo.temp[,1])

plot(cairo.temp,
     pch=19, col= adjustcolor("black", 0.5),
     main="Best fit (degree 10)")
curve(muhat10, from = xlim[1], to = xlim[2], 
      add = TRUE, col="red", lwd=2)

  • The visual assessment further confirms a decent fit of degree 10 polynomial.
  • This fit is quite different from degree 10 polynomial outside the range of the sample.

\(\;\)

\(\;\)

\(\;\)

  1. Instead of randomly constructing sample and test sets we can use k-fold cross-validation.

  2. Create a function creates the \(k\)-fold samples from a given population. i.e.
    • The function has arguements k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.
    • The function outputs a list containing the k-fold samples and test samples labelled as Ssamples and Tsamples.
    • The function rep_len might be helpful.
sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
  
  N = nrow(pop)
  kset = rep_len(1:k, N)
  kset = sample(kset)
  
  samps = list()
  for (i in 1:k) { 
    samps[[i]] = logical(N)
    samps[[i]][kset != i] = TRUE
  }

set.seed(341)
Ssamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})

Tsamples <- lapply(samps, 
                   FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})

    list(Ssamples=Ssamples,Tsamples=Tsamples)
}
  1. Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.
cairo.temp.muFun = getmuFun(cairo.temp, "x", 'y')
kfold.samples = sample.kfold(k=10, pop=cairo.temp, "x", "y")
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 2, mu = cairo.temp.muFun)
##        apse var_mutilde       bias2       var_y 
## 100.5682473   0.3267217  95.5191848   0.0000000
  1. Perform \(k=10\)-fold cross-validation to estimate the complexity parameter from the set \(0:15\). Plot APSE by the complexity.
complexities <- 0:15


apse_vals <- sapply(complexities, 
      FUN = function(complexity){
          apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, 
              complexity = complexity, mu = cairo.temp.muFun) })

# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
##       complexities      apse var_mutilde     bias2 var_y
##  [1,]            0 110.26060     0.08139 108.69017     0
##  [2,]            1 112.40437     0.21025 108.41738     0
##  [3,]            2 100.56825     0.32672  95.51918     0
##  [4,]            3 102.27981     0.43858  95.50643     0
##  [5,]            4  52.59967     0.54551  49.52610     0
##  [6,]            5  67.57154     1.05166  59.52932     0
##  [7,]            6  16.96717     1.21988  15.15112     0
##  [8,]            7  21.17531     0.80229  17.83607     0
##  [9,]            8  16.96280     0.63725  14.62305     0
## [10,]            9  15.99556     0.32656  13.97877     0
## [11,]           10  15.96780     0.29304  13.78307     0
## [12,]           11  15.59129     0.32778  13.36052     0
## [13,]           12  15.54063     0.32948  13.21977     0
## [14,]           13  15.56432     0.32155  13.24166     0
## [15,]           14  16.11326     0.45189  13.14555     0
## [16,]           15  16.18850     0.74751  12.68697     0
par(mfrow=c(1,2))
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals) ), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)

# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 7:15
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals[,zoom])  ), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)

  • We pick a 12 degree polynomial.